Red Wine Quality analysis by Harish Garg

Univariate Plots Section

Plots

## [1] "Summary of fixed.acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

## [1] "Summary of volatile.acidity"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

## [1] "Summary of citric.acid"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

## [1] "Summary of residual.sugar"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

## [1] "Summary of chlorides"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

## [1] "Summary of free.sulfur.dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

## [1] "Summary of total.sulfur.dioxide"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

## [1] "Summary of density"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

## [1] "Summary of pH"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

## [1] "Summary of sulphates"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

## [1] "Summary of alcohol"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## [1] "Summary of Quality"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Observations

Below are some of the observations on the above plots:

  • density and pH seem to have a normal distribution.

  • residual.sugar, chlorides, and sulphates seems to have a long tail on the positive side.

  • fixed.acidity, volatile.acidity, citric.acid, free.sulfur.dioxide, total.sulfur.dioxide and alcohol seem to have an approx poisson distribution.

Univariate Analysis

What is the structure of your dataset?

This dataset has 1599 observations and 13 variables. These 1599 observations correspond to 1599 types of red wines.

What is/are the main feature(s) of interest in your dataset?

  • “quality” is the dependent variable.
  • Rest of the 12 variables are independent variables. We will using the how the 12 independent variables relate to the depedent variable i.e. quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Let’s begin with finding the correlation between each independent variable and the depedent variable.

##                    X        fixed.acidity     volatile.acidity 
##                0.066                0.124                0.391 
##          citric.acid       residual.sugar            chlorides 
##                0.226                0.014                0.129 
##  free.sulfur.dioxide total.sulfur.dioxide              density 
##                0.051                0.185                0.175 
##                   pH            sulphates              quality 
##                0.058                0.251                1.000

Results seems to suggest we don’t none of the indepedent variables have strong correlation with the quality. So, we would need to work with mutiple independent variables to see if we get a stronger correlation with quality.

Did you create any new variables from existing variables in the dataset?

Not yet. Maybe will update this section, if I do create more variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

We are going to turn the quality variable into a factor as this will help us make it a classfication problem.

Bivariate Plots Section

Plots

##                          X fixed.acidity volatile.acidity citric.acid
## X                    1.000         0.268            0.009       0.154
## fixed.acidity        0.268         1.000            0.256       0.672
## volatile.acidity     0.009         0.256            1.000       0.552
## citric.acid          0.154         0.672            0.552       1.000
## residual.sugar       0.031         0.115            0.002       0.144
## chlorides            0.120         0.094            0.061       0.204
## free.sulfur.dioxide  0.090         0.154            0.011       0.061
## total.sulfur.dioxide 0.118         0.113            0.076       0.036
## density              0.368         0.668            0.022       0.365
## pH                   0.136         0.683            0.235       0.542
## sulphates            0.125         0.183            0.261       0.313
##                      residual.sugar chlorides free.sulfur.dioxide
## X                             0.031     0.120               0.090
## fixed.acidity                 0.115     0.094               0.154
## volatile.acidity              0.002     0.061               0.011
## citric.acid                   0.144     0.204               0.061
## residual.sugar                1.000     0.056               0.187
## chlorides                     0.056     1.000               0.006
## free.sulfur.dioxide           0.187     0.006               1.000
## total.sulfur.dioxide          0.203     0.047               0.668
## density                       0.355     0.201               0.022
## pH                            0.086     0.265               0.070
## sulphates                     0.006     0.371               0.052
##                      total.sulfur.dioxide density    pH sulphates
## X                                   0.118   0.368 0.136     0.125
## fixed.acidity                       0.113   0.668 0.683     0.183
## volatile.acidity                    0.076   0.022 0.235     0.261
## citric.acid                         0.036   0.365 0.542     0.313
## residual.sugar                      0.203   0.355 0.086     0.006
## chlorides                           0.047   0.201 0.265     0.371
## free.sulfur.dioxide                 0.668   0.022 0.070     0.052
## total.sulfur.dioxide                1.000   0.071 0.066     0.043
## density                             0.071   1.000 0.342     0.149
## pH                                  0.066   0.342 1.000     0.197
## sulphates                           0.043   0.149 0.197     1.000

Observations

  • Variables that don’t change much with quality - fixed.acidity, resdiual.sugar, chlorides
  • Variables that decrease as the quality gets higher - volatile.acidity, density, pH
  • Variables that increase as the quality gets higher - citric.acid, sulphates, alcohol
  • There is a strong relationship between pH and fixed.acidity(0.683)

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

volatile.acidity, density, pH, citric.acid, sulphates and alcohol values change as the quality changes.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

What was the strongest relationship you found?

The strongest relationship is between pH and fixed.acidity(0.683)

Multivariate Plots Section

Plots

Observations

Here, we are trying to find the correct combination of variables to distinguish the high quality wines from low quality. We talk about the analysis below.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Below variables, together seems to have a interesting relationships for distingushing between higher quality and lower quality wines - alcohol & chlorides - alcohol & volatile.acidity - alcohol & sulphates - sulphates & volatile.acidity

Were there any interesting or surprising interactions between features?

No, didn’t see any worth mentioning.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

A very high number of wines are of the the quality 5 or 6(> 4/5ths).

Plot Two

Description Two

As the wine quality increases, the median value of variables sulphates, alcohol & citric.acid increase and the median value of variables volatile.acidity, density & pH decrease.

Plot Three

Description Three

Two variables combinations gives us a little bit of insight into the differenc between wines with higher and lower quality.


Reflection

I started with printing some basis summary of the dataset and plotting the univariate plots. The dataset looked good for analysis so far. However, once I reached the bivariate analysis and trying to find the impact any of the independent variables have on quality, no clear winner emerged, which was quite discouraging in the begining.

In conlusion, We started with trying to find out variables, individually or in combination, that influence the quality of the wine. We conclude that no single variable can by it’s own predict the quality of the wine. We would need to use mutiple variables and do more analysis, proabbly with more data.